Human body pose estimation

ABSTRACT

Techniques for human body pose estimation are disclosed herein. Depth map images from a depth camera may be processed to calculate a probability that each pixel of the depth map is associated with one or more segments or body parts of a body. Body parts may then be constructed of the pixels and processed to define joints or nodes of those body parts. The nodes or joints may be provided to a system which may construct a model of the body from the various nodes or joints.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/454,628 filed May 20, 2009 which claims the benefit of U.S.Provisional Application No. 61/174,878, titled “Human Body PoseEstimation” filed May 1, 2009. Each of which is hereby incorporated byreferences in its entirety.

BACKGROUND

In a typical computing environment, a user has an input device such as akeyboard, a mouse, a joystick or the like, which may be connected to thecomputing environment by a cable, wire, wireless connection or the like.If control of a computing environment were to be shifted from aconnected controller to gesture or pose based control, the system willneed effective techniques to be able to determine what poses or gesturesa person is making Interpreting gestures or poses in a tracking andprocessing system without knowing the pose of a user's body may causethe system to misinterpret commands, or to miss them all together.

Further, a user of a tracking and processing system may stand at one ofvarious different possible angles with respect to a capture device, andthe user's gesture may appear differently to the capture devicedepending upon the particular angle of the user with respect to thecapture device. For example, if the capture device is unaware that theuser is not directly facing the capture device, then the user extendinghis arm directly forward could possibly be misinterpreted by the capturedevice as the user extending his arm partially to the left or the right.Thus, the system may not work properly without body pose estimation.

Accordingly, there is a need for technology that allows a tracking andprocessing system to determine the position of a user's body, and totherefore better interpret the gestures that the user is makes.

SUMMARY

Techniques for human body pose estimation are disclosed herein. Depthmap images from a depth camera may be processed to calculate aprobability that each pixel of the depth map is associated with one ormore segments or body parts of a body. Body parts may then beconstructed of the pixels and processed to define joints or nodes ofthose body parts. The nodes or joints may be provided to a system whichmay construct a model of the body from the various nodes or joints.

In an embodiment, a first pixel of a depth map may be associated withone or more body parts of one or more users. Association with a bodypart may mean that there is a high probability that the first pixel islocated within the body part. This probability may be determined bymeasuring the background depth, the depth of the first pixel, and thedepth of various other pixels around the first pixel.

The location and angle at which various other pixels around the firstpixel may be measured for depth may be determined by a feature testtraining program. In one embodiment, each time the depth at a pixel ismeasured, a determination of whether the pixel is within the depth rangeof the body is made. Based on the determination, the distance and anglefor the next test pixel may be provided. Selecting the test pixels insuch a way may increase the efficiency and robustness of the system.

Body poses, which may include pointing, xyz coordinates, joints,rotation, area, and any other aspects of one or more body parts of usermay be estimated for multiple users. In an embodiment, this may beaccomplished by assuming a user segmentation. For example, values may beassigned to an image such that a value 0 represents background, value 1represents user 1, value 2 represents user 2, etc. Given this playersegmentation image, it is possible to classify all user 1 pixels and doa three dimensional centroid finding, and then repeat this process forsubsequent users. In another embodiment, background subtraction may beperformed and the remaining foreground pixels (belonging to the multipleusers) may then be classified as associated with one or more body parts.In a further embodiment, the background may be considered another ‘bodypart’ and every pixel in the frame may be considered and associated withone or more body parts, including the background. When computingcentroids, it may be ensured that each centroid is spatially localized,so that a respective body part is present for each user. The centroidsmay then be combined into coherent models by, for example, connectingneighboring body parts throughout each user's body.

In an embodiment, after one or more initial body part probabilities arecalculated for each pixel, the initial probabilities for each pixel maybe compared with the initial probabilities of one or more offsetadjacent pixels to further refine the probability calculations. Forexample, if the initial probabilities suggest that adjacent pixels arein the same or adjacent body parts (i.e., head and neck), then thiswould increase the probabilities of the initial calculations. Bycontrast, if the initial probabilities suggest that adjacent pixels arein non-adjacent body parts (i.e., head and foot), then this woulddecrease the probabilities of the initial calculations.

BRIEF DESCRIPTION OF THE DRAWINGS

The file of this patent or application contains at least onedrawing/photograph executed in color. Copies of this patent or patentapplication publication with color drawing(s)/photograph(s) will beprovided by the Office upon request and payment of the necessary fee.

The systems, methods, and computer readable media for body poseestimation in accordance with this specification are further describedwith reference to the accompanying drawings in which:

FIGS. 1A, 1B, and 1C illustrate an example embodiment of a targetrecognition, analysis, and tracking system with a user playing a game.

FIG. 2 illustrates an example embodiment of a capture device that may beused in a target recognition, analysis, and tracking system.

FIG. 3 depicts an example embodiment of a depth image.

FIG. 4 illustrates an example embodiment of a computing environment thatmay be used to interpret one or more poses or gestures in a body poseestimation system.

FIG. 5 illustrates another example embodiment of a computing environmentthat may be used to interpret one or more poses or gestures in a bodypose estimation system.

FIG. 6 depicts a flow diagram of an example method for body poseestimation.

FIG. 7 depicts a flow diagram of an example depth feature test.

FIG. 8 depicts an example embodiment of pixels measured in a depthfeature/probability test.

FIG. 9 depicts a flow diagram of an example embodiment of a depthfeature/probability test tree.

FIG. 10 depicts an example embodiment of a segmented body used in bodypose estimation.

FIG. 11 depicts example embodiments of poses of a user and correspondingsegmented images which may be used in a training program to createfeature tests.

FIG. 12 depicts an example embodiment of assigning probabilitiesassociated with body parts using multiple feature tests.

FIG. 13 depicts an example embodiment of centroids/joints/nodes of bodyparts in body pose estimation.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

As will be described herein, a tracking and processing system determinebody pose estimation. When a user makes a gesture or pose, a trackingand processing system may receive the gesture or pose and associate oneor more commands with the user. In order to determine what response toprovide the user of a computing environment, the system may need to beable to determine the body pose of the user. Body poses may also be usedto determine skeletal models, determine the location of particular bodyparts and the like.

In an example embodiment, a tracking and processing system is providedwith a capture device, wherein the capture device comprises a depthcamera. The depth camera may capture a depth map of an image scene. Thecomputing environment may perform one or more processes on the depth mapto assign pixels on the depth map to segments of the users body. Fromthese assigned body parts, the computing environment may obtain nodes,centroids or joint positions of the body parts, and may provide thenodes, joints or centroids to one or more processes to create a 3-Dmodel of the body pose. In one aspect, the body pose is the threedimensional location of the set of body parts associated with a user. Inanother aspect, pose includes the three dimensional location of the bodypart, as well as the direction it is pointing, the rotation of the bodysegment or joint as well as any other aspects of the body part orsegment.

FIGS. 1A and 1B illustrate an example embodiment of a configuration of atracking and processing system 10 utilizing body pose estimation with auser 18 playing a boxing game. In an example embodiment, the trackingand processing system 10 may be used to, among other things, determinebody pose, bind, recognize, analyze, track, associate to a human target,provide feedback, interpret poses or gestures, and/or adapt to aspectsof the human target such as the user 18.

As shown in FIG. 1A, the tracking and processing system 10 may include acomputing environment 12. The computing environment 12 may be acomputer, a gaming system or console, or the like. According to anexample embodiment, the computing environment 12 may include hardwarecomponents and/or software components such that the computingenvironment 12 may be used to execute applications such as gamingapplications, non-gaming applications, or the like.

As shown in FIG. 1A, the tracking and processing system 10 may furtherinclude a capture device 20. The capture device 20 may be, for example,a detector that may be used to monitor one or more users, such as theuser 18, such that poses performed by the one or more users may becaptured, analyzed, processed, and tracked to perform one or morecontrols or actions within an application, as will be described in moredetail below.

According to one embodiment, the tracking and processing system 10 maybe connected to an audiovisual device 16 such as a television, amonitor, a high-definition television (HDTV), or the like that mayprovide game or application visuals and/or audio to the user 18. Forexample, the computing environment 12 may include a video adapter suchas a graphics card and/or an audio adapter such as a sound card that mayprovide audiovisual signals associated with the feedback about virtualports and binding, game application, non-game application, or the like.The audiovisual device 16 may receive the audiovisual signals from thecomputing environment 12 and may then output the game or applicationvisuals and/or audio associated with the audiovisual signals to the user18. According to one embodiment, the audiovisual device 16 may beconnected to the computing environment 12 via, for example, an S-Videocable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, awireless connection or the like.

As shown in FIGS. 1A and 1B, the tracking and processing system 10 maybe used to recognize, analyze, process, determine the pose of, and/ortrack a human target such as the user 18. For example, the user 18 maybe tracked using the capture device 20 such that the position, movementsand size of user 18 may be interpreted as controls that may be used toaffect the application being executed by computer environment 12. Thus,according to one embodiment, the user 18 may move his or her body tocontrol the application.

As shown in FIGS. 1A and 1B, in an example embodiment, the applicationexecuting on the computing environment 12 may be a boxing game that theuser 18 may be playing. For example, the computing environment 12 mayuse the audiovisual device 16 to provide a visual representation of aboxing opponent 22 to the user 18. The computing environment 12 may alsouse the audiovisual device 16 to provide a visual representation of auser avatar 24 that the user 18 may control with his or her movements ona screen 14. For example, as shown in FIG. 1B, the user 18 may throw apunch in physical space to cause the user avatar 24 to throw a punch ingame space. Thus, according to an example embodiment, the computerenvironment 12 and the capture device 20 of the tracking and processingsystem 10 may be used to recognize and analyze the punch of the user 18in physical space such that the punch may be interpreted as a gamecontrol of the user avatar 24 in game space.

The user 18 may be associated with a virtual port in computingenvironment 12. Feedback of the state of the virtual port may be givento the user 18 in the form of a sound or display on audiovisual device16, a display such as an LED or light bulb, or a speaker on thecomputing environment 12, or any other means of providing feedback tothe user. The feedback may be used to inform a user when he is in acapture area of capture device 20, if he is bound to the tracking andprocessing system 10, what virtual port he is associated with, and whenhe has control over an avatar such as avatar 24. Gestures and poses byuser 18 may change the state of the system, and thus the feedback thatthe user receives from the system.

Other movements by the user 18 may also be interpreted as other controlsor actions, such as controls to bob, weave, shuffle, block, jab, orthrow a variety of different power punches. Furthermore, some movementsmay be interpreted as controls that may correspond to actions other thancontrolling the user avatar 24. For example, the user may use movementsto enter, exit, turn system on or off, pause, volunteer, switch virtualports, save a game, select a level, profile or menu, view high scores,communicate with a friend, etc. Additionally, a full range of motion ofthe user 18 may be available, used, and analyzed in any suitable mannerto interact with an application.

In FIG. 1C, the human target such as the user 18 may have an object suchas racket 21. In such embodiments, the user of an electronic game may beholding the object such that the motions of the user and the object maybe used to adjust and/or control parameters of the game, such as, forexample, hitting an onscreen ball 23. The motion of a user holding aracket 21 may be tracked and utilized for controlling an on-screenracket in an electronic sports game. In another example embodiment, themotion of a user holding an object may be tracked and utilized forcontrolling an on-screen weapon in an electronic combat game. Any otherobject may also be included, such as one or more gloves, balls, bats,clubs, guitars, microphones, sticks, pets, animals, drums and the like.

According to other example embodiments, the tracking and processingsystem 10 may further be used to interpret target movements as operatingsystem and/or application controls that are outside the realm of games.For example, virtually any controllable aspect of an operating systemand/or application may be controlled by movements of the target such asthe user 18.

As shown in FIG. 2, according to an example embodiment, the image cameracomponent 25 may include an IR light component 26, a three-dimensional(3-D) camera 27, and an RGB camera 28 that may be used to capture thedepth image of a scene. For example, in time-of-flight analysis, the IRlight component 26 of the capture device 20 may emit an infrared lightonto the scene and may then use sensors (not shown) to detect thebackscattered light from the surface of one or more targets and objectsin the scene using, for example, the 3-D camera 27 and/or the RGB camera28. In some embodiments, pulsed infrared light may be used such that thetime between an outgoing light pulse and a corresponding incoming lightpulse may be measured and used to determine a physical distance from thecapture device 20 to a particular location on the targets or objects inthe scene. Additionally, in other example embodiments, the phase of theoutgoing light wave may be compared to the phase of the incoming lightwave to determine a phase shift. The phase shift may then be used todetermine a physical distance from the capture device to a particularlocation on the targets or objects.

According to another example embodiment, time-of-flight analysis may beused to indirectly determine a physical distance from the capture device20 to a particular location on the targets or objects by analyzing theintensity of the reflected beam of light over time via varioustechniques including, for example, shuttered light pulse imaging.

In another example embodiment, the capture device 20 may use astructured light to capture depth information. In such an analysis,patterned light (i.e., light displayed as a known pattern such as gridpattern or a stripe pattern) may be projected onto the scene via, forexample, the IR light component 26. Upon striking the surface of one ormore targets or objects in the scene, the pattern may become deformed inresponse. Such a deformation of the pattern may be captured by, forexample, the 3-D camera 27 and/or the RGB camera 28 and may then beanalyzed to determine a physical distance from the capture device to aparticular location on the targets or objects.

According to another embodiment, the capture device 20 may include twoor more physically separated cameras that may view a scene fromdifferent angles, to obtain visual stereo data that may be resolved togenerate depth information. Depth may also be determined by capturingimages using one or more detectors that may be monochromatic, infrared,RGB or any other type of detector and performing a parallax calculation.

The capture device 20 may further include a microphone 30. Themicrophone 30 may include a transducer or sensor that may receive andconvert sound into an electrical signal. According to one embodiment,the microphone 30 may be used to reduce feedback between the capturedevice 20 and the computing environment 12 in the tracking andprocessing system 10. Additionally, the microphone 30 may be used toreceive audio signals that may also be provided by the user to controlapplications such as game applications, non-game applications, or thelike that may be executed by the computing environment 12.

The capture device 20 may further include a feedback component 31. Thefeedback component 31 may comprise a light such as an LED or a lightbulb, a speaker or the like. The feedback device may perform at leastone of changing colors, turning on or off, increasing or decreasing inbrightness, and flashing at varying speeds. The feedback component 31may also comprise a speaker which may provide one or more sounds ornoises as a feedback of one or more states. The feedback component mayalso work in combination with computing environment 12 or processor 32to provide one or more forms of feedback to a user by means of any otherelement of the capture device, the tracking and processing system or thelike.

In an example embodiment, the capture device 20 may further include aprocessor 32 that may be in operative communication with the imagecamera component 25. The processor 32 may include a standardizedprocessor, a specialized processor, a microprocessor, or the like thatmay execute instructions that may include instructions for receiving thedepth image, determining whether a suitable target may be included inthe depth image, converting the suitable target into a skeletalrepresentation or model of the target, determining the body pose, or anyother suitable instruction.

The capture device 20 may further include a memory component 34 that maystore the instructions that may be executed by the processor 32, imagesor frames of images captured by the 3-D camera or RGB camera, userprofiles or any other suitable information, images, or the like.According to an example embodiment, the memory component 34 may includerandom access memory (RAM), read only memory (ROM), cache, Flash memory,a hard disk, or any other suitable storage component. As shown in FIG.2, in one embodiment, the memory component 34 may be a separatecomponent in communication with the image capture component 25 and theprocessor 32. According to another embodiment, the memory component 34may be integrated into the processor 32 and/or the image capturecomponent 25.

As shown in FIG. 2, the capture device 20 may be in communication withthe computing environment 12 via a communication link 36. Thecommunication link 36 may be a wired connection including, for example,a USB connection, a Firewire connection, an Ethernet cable connection,or the like and/or a wireless connection such as a wireless 802.11b, g,a, or n connection. According to one embodiment, the computingenvironment 12 may provide a clock to the capture device 20 that may beused to determine when to capture, for example, a scene via thecommunication link 36.

Additionally, the capture device 20 may provide the depth informationand images captured by, for example, the 3-D camera 27 and/or the RGBcamera 28, and a skeletal model that may be generated by the capturedevice 20 or the computing environment to the computing environment 12via the communication link 36. The computing environment 12 may then usethe skeletal model, depth information, and captured images to, forexample, create a virtual screen, adapt the user interface and controlan application such as a game or word processor. For example, as shown,in FIG. 2, the computing environment 12 may include a gestures library190. The gestures library 190 may include a collection of gesturefilters, each comprising information concerning a gesture that may beperformed by the skeletal model (as the user moves). The data capturedby the cameras 26, 27 and device 20 in the form of the skeletal modeland movements associated with it may be compared to the gesture filtersin the gesture library 190 to identify when a user (as represented bythe skeletal model) has performed one or more gestures. Those gesturesor poses may be associated with various controls of an application.Thus, the computing environment 12 may use the gestures library 190 tointerpret movements of the skeletal model and to control an applicationbased on the movements.

FIG. 3 illustrates an example embodiment of a depth image 60 that may bereceived by the tracking and processing system and/or the computingenvironment. According to an example embodiment, the depth image 60 maybe an image or frame of a scene captured by, for example, the 3-D camera27 and/or the RGB camera 28 of the capture device 20 described abovewith respect to FIG. 2. As shown in FIG. 3, the depth image 60 mayinclude a human target 62 and one or more non-human targets 64 such as awall, a table, a monitor, or the like in the captured scene. Asdescribed above, the depth image 60 may include a plurality of observedpixels where each observed pixel has an observed depth value associatedtherewith. For example, the depth image 60 may include a two-dimensional(2-D) pixel area of the captured scene where each pixel in the 2-D pixelarea may represent a depth value such as a length or distance in, forexample, centimeters, millimeters, or the like of a target or object inthe captured scene from the capture device.

According to one embodiment, a depth image such as depth image 60 or animage on an RGB camera such as camera 28, or an image on any otherdetector may be processed and used to determine the shape and size of atarget. In another embodiment, the depth image 60 may be used todetermine the body pose of a user. The body may be divided into a seriesof segments and each pixel of a depth map 60 may be assigned aprobability that it is associated with each segment. This informationmay be provided to one or more processes which may determine thelocation of nodes, joints, centroids or the like to determine a skeletalmodel and interpret the motions of a user 62 for pose or gesture basedcommand.

Referring back to FIG. 2, in one embodiment, upon receiving the depthimage, the depth image may be downsampled to a lower processingresolution such that the depth image may be more easily used and/or morequickly processed with less computing overhead. Additionally, one ormore high-variance and/or noisy depth values may be removed and/orsmoothed from the depth image; portions of missing and/or removed depthinformation may be filled in and/or reconstructed; and/or any othersuitable processing may be performed on the received depth informationmay such that the depth information may used to size a virtual screen ona user as described above.

FIG. 4 illustrates an example embodiment of a computing environment thatmay be used to interpret one or more gestures in a target recognition,analysis, and tracking system. The computing environment such as thecomputing environment 12 described above with respect to FIGS. 1A-2 maybe a multimedia console 100, such as a gaming console. As shown in FIG.4, the multimedia console 100 has a central processing unit (CPU) 101having a level 1 cache 102, a level 2 cache 104, and a flash ROM (ReadOnly Memory) 106. The level 1 cache 102 and a level 2 cache 104temporarily store data and hence reduce the number of memory accesscycles, thereby improving processing speed and throughput. The CPU 101may be provided having more than one core, and thus, additional level 1and level 2 caches 102 and 104. The flash ROM 106 may store executablecode that is loaded during an initial phase of a boot process when themultimedia console 100 is powered ON.

A graphics processing unit (GPU) 108 and a video encoder/video codec(coder/decoder) 114 form a video processing pipeline for high speed andhigh resolution graphics processing. Data is carried from the graphicsprocessing unit 108 to the video encoder/video codec 114 via a bus aswell as to the CPU. The video processing pipeline outputs data to an AN(audio/video) port 140 for transmission to a television or otherdisplay. A memory controller 110 is connected to the GPU 108 tofacilitate processor access to various types of memory 112, such as, butnot limited to, a RAM (Random Access Memory).

The multimedia console 100 includes an I/O controller 120, a systemmanagement controller 122, an audio processing unit 123, a networkinterface controller 124, a first USB host controller 126, a second USBcontroller 128 and a front panel I/O subassembly 130 that are preferablyimplemented on a module 118. The USB controllers 126 and 128 serve ashosts for peripheral controllers 142(1)-142(2), a wireless adapter 148,and an external memory device 146 (e.g., flash memory, external CD/DVDROM drive, removable media, etc.). The network interface 124 and/orwireless adapter 148 provide access to a network (e.g., the Internet,home network, etc.) and may be any of a wide variety of various wired orwireless adapter components including an Ethernet card, a modem, aBluetooth module, a cable modem, and the like.

System memory 143 is provided to store application data that is loadedduring the boot process. A media drive 144 is provided and may comprisea DVD/CD drive, hard drive, or other removable media drive, etc. Themedia drive 144 may be internal or external to the multimedia console100. Application data may be accessed via the media drive 144 forexecution, playback, etc. by the multimedia console 100. The media drive144 is connected to the I/O controller 120 via a bus, such as a SerialATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 122 provides a variety of servicefunctions related to assuring availability of the multimedia console100. The audio processing unit 123 and an audio codec 132 form acorresponding audio processing pipeline with high fidelity and stereoprocessing. Audio data is carried between the audio processing unit 123and the audio codec 132 via a communication link. The audio processingpipeline outputs data to the A/V port 140 for reproduction by anexternal audio player or device having audio capabilities.

The front panel I/O subassembly 130 supports the functionality of thepower button 150 and the eject button 152, as well as any LEDs (lightemitting diodes) or other indicators exposed on the outer surface of themultimedia console 100. A system power supply module 136 provides powerto the components of the multimedia console 100. A fan 138 cools thecircuitry within the multimedia console 100.

The front panel I/O subassembly 130 may include LEDs, a visual displayscreen, light bulbs, a speaker or any other means that may provide audioor visual feedback of the state of control of the multimedia control 100to a user 18. For example, if the system is in a state where no usersare detected by capture device 20, such a state may be reflected onfront panel I/O subassembly 130. If the state of the system changes, forexample, a user becomes bound to the system, the feedback state may beupdated on the front panel I/O subassembly to reflect the change instates.

The CPU 101, GPU 108, memory controller 110, and various othercomponents within the multimedia console 100 are interconnected via oneor more buses, including serial and parallel buses, a memory bus, aperipheral bus, and a processor or local bus using any of a variety ofbus architectures. By way of example, such architectures can include aPeripheral Component Interconnects (PCI) bus, PCI-Express bus, etc.

When the multimedia console 100 is powered ON, application data may beloaded from the system memory 143 into memory 112 and/or caches 102, 104and executed on the CPU 101. The application may present a graphicaluser interface that provides a consistent user experience whennavigating to different media types available on the multimedia console100. In operation, applications and/or other media contained within themedia drive 144 may be launched or played from the media drive 144 toprovide additional functionalities to the multimedia console 100.

The multimedia console 100 may be operated as a standalone system bysimply connecting the system to a television or other display. In thisstandalone mode, the multimedia console 100 allows one or more users tointeract with the system, watch movies, or listen to music. However,with the integration of broadband connectivity made available throughthe network interface 124 or the wireless adapter 148, the multimediaconsole 100 may further be operated as a participant in a larger networkcommunity.

When the multimedia console 100 is powered ON, a set amount of hardwareresources are reserved for system use by the multimedia consoleoperating system. These resources may include a reservation of memory(e.g., 16 MB), CPU and GPU cycles (e.g., 5%), networking bandwidth(e.g., 8 kbs), etc. Because these resources are reserved at system boottime, the reserved resources do not exist from the application's view.

In particular, the memory reservation preferably is large enough tocontain the launch kernel, concurrent system applications and drivers.The CPU reservation is preferably constant such that if the reserved CPUusage is not used by the system applications, an idle thread willconsume any unused cycles.

With regard to the GPU reservation, lightweight messages generated bythe system applications (e.g., popups) are displayed by using a GPUinterrupt to schedule code to render popup into an overlay. The amountof memory required for an overlay depends on the overlay area size andthe overlay preferably scales with screen resolution. Where a full userinterface is used by the concurrent system application, it is preferableto use a resolution independent of application resolution. A scaler maybe used to set this resolution such that the need to change frequencyand cause a TV resynch is eliminated.

After the multimedia console 100 boots and system resources arereserved, concurrent system applications execute to provide systemfunctionalities. The system functionalities are encapsulated in a set ofsystem applications that execute within the reserved system resourcesdescribed above. The operating system kernel identifies threads that aresystem application threads versus gaming application threads. The systemapplications are preferably scheduled to run on the CPU 101 atpredetermined times and intervals in order to provide a consistentsystem resource view to the application. The scheduling is to minimizecache disruption for the gaming application running on the console.

When a concurrent system application requires audio, audio processing isscheduled asynchronously to the gaming application due to timesensitivity. A multimedia console application manager (described below)controls the gaming application audio level (e.g., mute, attenuate) whensystem applications are active.

Input devices (e.g., controllers 142(1) and 142(2)) are shared by gamingapplications and system applications. The input devices are not reservedresources, but are to be switched between system applications and thegaming application such that each will have a focus of the device. Theapplication manager preferably controls the switching of input stream,without knowledge the gaming application's knowledge and a drivermaintains state information regarding focus switches. The cameras 27, 28and capture device 20 may define additional input devices for theconsole 100.

FIG. 5 illustrates another example embodiment of a computing environmentthat may be the computing environment 12 shown in FIGS. 1A-2 used tointerpret one or more poses or gestures in a tracking and processingsystem. The computing system environment of FIG. 5 is only one exampleof a suitable computing environment and is not intended to suggest anylimitation as to the scope of use or functionality of the presentlydisclosed subject matter. Neither should the computing environment 12 beinterpreted as having any dependency or requirement relating to any oneor combination of components illustrated in the exemplary operatingenvironment of FIG. 5. In some embodiments the various depictedcomputing elements may include circuitry configured to instantiatespecific aspects of the present disclosure. For example, the termcircuitry used in the disclosure can include specialized hardwarecomponents configured to perform function(s) by firmware or switches. Inother examples embodiments the term circuitry can include a generalpurpose processing unit, memory, etc., configured by softwareinstructions that embody logic operable to perform function(s). Inexample embodiments where circuitry includes a combination of hardwareand software, an implementer may write source code embodying logic andthe source code can be compiled into machine readable code that can beprocessed by the general purpose processing unit. Since one skilled inthe art can appreciate that the state of the art has evolved to a pointwhere there is little difference between hardware, software, or acombination of hardware/software, the selection of hardware versussoftware to effectuate specific functions is a design choice left to animplementer. More specifically, one of skill in the art can appreciatethat a software process can be transformed into an equivalent hardwarestructure, and a hardware structure can itself be transformed into anequivalent software process. Thus, the selection of a hardwareimplementation versus a software implementation is one of design choiceand left to the implementer.

In FIG. 5, the computing environment comprises a computer 241, whichtypically includes a variety of computer readable media. Computerreadable media can be any available media that can be accessed bycomputer 241 and includes both volatile and nonvolatile media, removableand non-removable media. The system memory 222 includes computer storagemedia in the form of volatile and/or nonvolatile memory such as readonly memory (ROM) 223 and random access memory (RAM) 260. A basicinput/output system 224 (BIOS), containing the basic routines that helpto transfer information between elements within computer 241, such asduring start-up, is typically stored in ROM 223. RAM 260 typicallycontains data and/or program modules that are immediately accessible toand/or presently being operated on by processing unit 259. By way ofexample, and not limitation, FIG. 5 illustrates operating system 225,application programs 226, other program modules 227, and program data228.

The computer 241 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 5 illustrates a hard disk drive 238 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 239that reads from or writes to a removable, nonvolatile magnetic disk 254,and an optical disk drive 240 that reads from or writes to a removable,nonvolatile optical disk 253 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 238 is typically connectedto the system bus 221 through a non-removable memory interface such asinterface 234, and magnetic disk drive 239 and optical disk drive 240are typically connected to the system bus 221 by a removable memoryinterface, such as interface 235.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 5, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 241. In FIG. 5, for example, hard disk drive 238 is illustratedas storing operating system 258, application programs 257, other programmodules 256, and program data 255. Note that these components can eitherbe the same as or different from operating system 225, applicationprograms 226, other program modules 227, and program data 228. Operatingsystem 258, application programs 257, other program modules 256, andprogram data 255 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 241 through input devices such as akeyboard 251 and pointing device 252, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit259 through a user input interface 236 that is coupled to the systembus, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). The cameras27, 28 and capture device 20 may define additional input devices for theconsole 100. A monitor 242 or other type of display device is alsoconnected to the system bus 221 via an interface, such as a videointerface 232. In addition to the monitor, computers may also includeother peripheral output devices such as speakers 244 and printer 243,which may be connected through a output peripheral interface 233.

The computer 241 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer246. The remote computer 246 may be a personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 241, although only a memory storage device 247 has beenillustrated in FIG. 5. The logical connections depicted in FIG. 5include a local area network (LAN) 245 and a wide area network (WAN)249, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

When used in a LAN networking environment, the computer 241 is connectedto the LAN 245 through a network interface or adapter 237. When used ina WAN networking environment, the computer 241 typically includes amodem 250 or other means for establishing communications over the WAN249, such as the Internet. The modem 250, which may be internal orexternal, may be connected to the system bus 221 via the user inputinterface 236, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 241, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 5 illustrates remoteapplication programs 248 as residing on memory device 247. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

FIG. 6 depicts a block diagram 300 whereby body pose estimation may beperformed. In one embodiment, at 302, a depth map such as depth map 60may be received by the tracking and processing system. Probabilitiesassociated with one or more virtual body parts may be assigned to pixelson a depth map at 304. A centroid may be calculated for sets ofassociated pixels associated with a virtual body part, which may be anode, joint or centroid at 306. Centroids may be representations ofjoints or nodes of a body, and may be calculated using any mathematicalalgorithm, including, for example, averaging the coordinates of everypixel in a depth map having a threshold probability that it isassociated with a body part, or, as another example, a linear regressiontechnique. At 308, the various nodes, joints or centroids associatedwith the body parts may be combined into a model, which may be providedto one or more programs in a tracking and processing system. The modelmay include not only the location in three dimensions of the joints orbody parts, but may also include the rotation of a joint or any otherinformation about the pointing of the body part.

Body poses may be estimated for multiple users. In an embodiment, thismay be accomplished by assuming a user segmentation. For example, valuesmay be assigned to an image such that a value 0 represents background,value 1 represents user 1, value 2 represents user 2, etc. Given thisplayer segmentation image, it is possible to classify all user 1 pixelsand do a centroid finding, and then repeat this process for subsequentusers. In another embodiment, background subtraction may be performedand the remaining foreground pixels (belonging to the multiple users)may then be classified. When computing centroids, it may be ensured thateach centroid is spatially localized, so that a respective body part ispresent for each user. The centroids may then be combined into coherentmodels by, for example, connecting neighboring body parts throughouteach user's body.

FIG. 7 depicts a sample flow chart for assigning probabilitiesassociated with virtual body parts to a depth map. In an exampleembodiment, the process of FIG. 7 may be performed at 304 of FIG. 6.Process 350 may employ a depth map received at 302 to assignprobabilities associated with virtual body parts at 304. One or morebackground depths on a depth map may be established at 352. For example,one background depth may correspond to a wall in the back of a room,other background depths may correspond to other humans or objects in theroom. These background depths may be used later in flowchart of FIG. 7to determine if a pixel on the depth map is part of a particular user'sbody or whether the pixel may be associated with the background.

At 353, a first location may be selected in the depth map. The depth ofthe first location may be determined at 354. At 356, the depth of thefirst location may be compared with one or more background depths. Ifthe first location depth is at the same or within a specified thresholdrange of a background depth, then, at 358, the first location isdetermined to be part of the background and not part of any body parts.If the first location is not at or within a specified threshold range ofa background depth, an offset location, referenced with respect to thefirst location, may be selected at 360. At 362, the depth of the offsetlocation may be determined and a depth test may be performed todetermine if the offset location is background. At 354, it is determinedwhether any additional offset locations are desired.

The determination of whether or not to select additional offsetlocations, as well as the angle and distance of the additional offsetlocations from the first location, may be made based in part on thedepth of the previous offset location(s) with respect to the firstlocation and/or the background. These determinations may also be madebased on additional factors such as the training module described below.In one embodiment, the offsets will scale with depth. For example, if auser is very close to a detector in a capture area, depth may bemeasured at large offset distances from the first pixel. If the userwere to move twice as far from a detector, then the offset distances maydecrease by a factor of two. In one embodiment, this scaling causes thedepth offset tests to be invariant. Any number of offset locations maybe selected and depth tested, after which a probability that the firstlocation is associated with one or more body parts is calculated at 366.This calculation may be based in part on the depth of the first locationand the offset locations with respect to the one or more backgrounddepths. This calculation may also be made based on additional factorssuch as the training module described below.

In another embodiment, 352 may not be performed. In this embodiment,each pixel in a depth map is examined for depth at 354, and then themethod proceeds directly to choosing offset locations at 360. In such anexample, every pixel in a depth map may be examined for depth or for theprobability that it is associated with one or more body parts and/orbackground. From the determinations made at the first pixel and theoffset locations, probabilities may be associated with one or morepixels.

FIG. 8 depicts an instance of the flow chart referenced in FIG. 7. Inthe flow chart of FIG. 7, a series of feature tests may be used todetermine the probability that a pixel in a depth map is associated withone or more body parts. A first location pixel is selected at 480. Afirst offset pixel is examined at 482, and a second offset pixel isexamined at 484. As more pixels are examined for depth, the probabilitythat a particular pixel is associated with a part of the body maydecrease or increase. This probability may be provided to otherprocesses in a tracking and processing system.

In another example depicted by FIG. 8, a first location pixel of a depthmap is selected at 480, wherein the depth map has probabilities thateach pixel in the depth map is associated with one or more body partsalready assigned to each pixel. A second offset pixel is examined forits associated probability at 484. As more pixels are examined for theirassociated probabilities, a second pass at the probability associatedwith the first pixel may provide a more accurate determination of thebody part associated with the pixel. This probability may be provided toother processes in a tracking and processing system.

FIG. 9 depicts a flow chart of another example implementation of featuretesting in body pose estimation. A depth map is received and a firstpixel location is selected at 502. This may be the pixel depicted atFIG. 8 as the first location. If the first pixel is at the backgrounddepth, then probabilities associated with each body part may be zero.If, however, the first pixel is not at the background depth, an angleand distance to a second pixel may be selected at 504.

In another embodiment, a background depth is not determined, insteaddepth tests and the surrounding offset depth tree tests may be performedat each pixel, regardless of its depth.

In another embodiment, the depth map received at 502 already has theprobability that each pixel is associated with one or more body partsassigned to each pixel. Accordingly, instead of testing depth at thefirst pixel and at offset locations, the probabilities may be tested.

A depth/probability test may be performed on the second pixel at 506. Ifthe second pixel fails the depth/probability test (i.e. it is at thebackground depth/probability, the depth/probability of a second user,not within the range of a users body or the like) then location F-1 isselected at 510. If, however, the second pixel passes thedepth/probability test (i.e. it is within a threshold of the bodydepth/probability), then location P-1 is selected at 508.Depth/probability tests will then be performed on third pixels at 508 or510, and based on whether the third pixels pass or fail thedepth/probability test, other pixel locations will be selected at one of512, 514, 516 or 518. While these locations may, in some cases, be thesame, they may also vary widely in location based on the results of thedepth/probability tests.

In an example embodiment, depth/probability tests on any number ofpixels may be performed with reference to a single pixel. For example,16 tests may be performed, where each depth/probability test is at adifferent pixel. By performing some quantity of depth/probability tests,the probability that a pixel is associated with each body part may beassigned to each pixel. As another example, only one test may need to beperformed on a particular pixel in order to determine the probabilitythat it is associated with one or more body parts.

FIG. 10 depicts an example image that may come from a capture device,such as capture device 20, a graphics package, or other 3-D renderingalong with a segmented body image of the example image. Original image550 may be may be a depth map or other image from the capture device. Inan example embodiment, the image of a body may be segmented into manyparts as in segmented image 552, and each pixel in a depth map may beassociated with a probability for each of the segments in FIG. 10. Thisprobability may be determined using the methods, processes and systemsdescribed with respect to FIGS. 7, 8 and 9.

FIG. 11 depicts a series of images of poses from one or more users. Foreach pose, an image that may be received from a capture device such ascapture device 20 is shown adjacent to an image of the pose that hasbeen segmented into parts.

In a first embodiment, the tracking and processing system may receivethe non-segmented images 602, 606, 610, and 614, and use the processesdescribed at FIGS. 7, 8 and 9 to determine the probability that eachpixel in the image is associated with each of the segmented body parts.The purpose of the processes described in FIGS. 7, 8 and 9 may be tosegment the body into each of the parts shown at 604, 608, 612 and 616.These segmented parts may be used by one or more computer processes todetermine the body pose of the user.

In a second embodiment, these images may be used in a feature testtraining module to determine the feature test of FIGS. 7, 8, and 9.Recall from FIGS. 7, 8, and 9 that a depth test may be performed on apixel, and it either passes or fails, and based on the pass or fail, anext location will be selected. In one embodiment, the next locationselected is not arbitrary, but is selected based on a training module. Atraining module may involve inputting a volume of thousands, hundreds ofthousands, millions or any number of segmented poses such as those shownin FIG. 11 into a program. The program may perform one or moreoperations on the volume of poses to determine optimal feature tests foreach pass or fail for the full volume, or some selection of poses. Thisoptimized series of feature tests may be known as feature test trees.

A volume of poses input into a feature test training module may notcontain every possible pose by a user. Further, it may increase theefficiency of the program to create several feature test trainingmodules, each of which are based on a separate volume of body poses.Accordingly, the feature tests at each step of a feature test tree maybe different and the final probabilities associated with each segment ofa body at the conclusion of a test tree may also be different. In oneembodiment, several feature test trees are provided for each pixel andthe probabilities output from each test tree may be averaged orotherwise combined to provide a segmented image of a body pose.

FIG. 12 depicts an example flow chart to determine body segmentprobabilities associated with each pixel in human body pose estimation.At 650 a depth map such as the depth map shown in FIG. 3 may be receivedfrom a capture device 20. This depth map may be provided to a series offeature test trees at 652. In FIG. 12, three feature test trees, eachhaving been trained on a different volume of body poses, test each pixelof a depth map. The probability that each pixel is associated with eachsegment of the body is determined at 654 as the soft body parts. In anexample embodiment, the process stops here and these probabilities maybe used to obtain the joints/nodes/centroids of FIG. 6 at 306.

In another embodiment, at 656, the depth map may again be provided to aseries of feature test trees, each of which may have been created usinga different volume of body pose images. In FIG. 12, this second seriesof feature tests contains three trees, each of which may output aprobability for each pixel of the depth map associated with each segmentof a body. At 658, the probabilities from the second set of feature testtrees 656 and the soft body parts from 654 may be combined by averagingor some other method to determine the second pass of the body parts.FIG. 12 shows two sets of three feature test trees, however, the numberof feature test trees is not limited by the number three, nor are thenumber of passes limited by FIG. 12. There may be any number of featuretest trees and any number of passes.

In another embodiment, at 656, the depth map provided to the series offeature test trees may have the probability that each pixel of a depthmap is associated with one or more body parts already associated witheach pixel. For example, the probability maps determined by the featuretest trees at 652 may be provided to the feature test trees at 656. Insuch a circumstance, instead of depth test training programs and trees,the system instead utilizes probability test training programs andtrees. The number of trees and passes is not limited in any way, and thetrees may be any combination of depth and probability feature tests.

FIG. 13 depicts a segmented body pose image wherein each segmentcontains a node/joint/centroid, such as those described at 306 withreference to FIG. 6. These joints/nodes/centroids may be determined bytaking the centroid of all of the pixels associated with a body partsegment after performing the feature tests of FIGS. 7, 8, 9, and 12.Other methods may also be used to determine the location of thenodes/centroids/joints. For example, a filtering process may removeoutlying pixels or the like, after which a process may take place todetermine the location of the joints/nodes/centroids.

The joints/nodes/centroids of FIG. 13 may be used to construction askeletal model, or otherwise represent the body pose of a user. Thismodel may be used by the tracking and processing system in any way,including determining the commands of one or more users, identifying oneor more users and the like.

It should be understood that the configurations and/or approachesdescribed herein are exemplary in nature, and that these specificembodiments or examples are not to be considered limiting. The specificroutines or methods described herein may represent one or more of anynumber of processing strategies. As such, various acts illustrated maybe performed in the sequence illustrated, in other sequences, inparallel, or the like. Likewise, the order of the above-describedprocesses may be changed.

Additionally, the subject matter of the present disclosure includescombinations and subcombinations of the various processes, systems andconfigurations, and other features, functions, acts, and/or propertiesdisclosed herein, as well as equivalents thereof.

What is claimed:
 1. A method for determining a position of a body, eachsegment being associated with a portion of the body, the methodcomprising: receiving a depth image that includes at least part of thebody; determining that a first pixel in the depth image corresponds tothe body; and determining that a second pixel in the depth imagecorresponds to the body based on an angle and a distance of the secondpixel relative to the first pixel.
 2. The method of claim 1, whereindetermining that the first pixel in the depth image corresponds to thebody comprises determining a first probability that the first pixel inthe depth image corresponds to the body, and wherein determining thatthe second pixel in the depth image corresponds to the body comprises:determining that the second pixel in the depth image corresponds to thebody based on the angle and the distance of the second pixel relative tothe first pixel, and the first probability.
 3. The method of claim 2,further comprising: determining that a third pixel in the depth imagehas a zero probability of corresponding to the body based on the thirdpixel having a depth value associated with a background.
 4. The methodof claim 1, further comprising: selecting a third pixel among aplurality of pixels in the depth image for determination of whether thethird pixel corresponds to the body based on determining that the secondpixel in the depth image corresponds to the body, and based on an angleand a distance of the third pixel relative to the second pixel.
 5. Themethod of claim 1, further comprising: determining the angle and thedistance of the second pixel relative to the first pixel based on adecision tree.
 6. The method of claim 1, further comprising: determiningthe angle and the distance of the second pixel relative to the firstpixel invariantly with respect to depth.
 7. The method of claim 1,wherein determining that the second pixel in the depth image correspondsto the body comprises: determining that the second pixel in the depthimage corresponds to the body based on the angle and the distance of thesecond pixel relative to the first pixel, and a depth value of the firstpixel.
 8. A system for determining a position of a body, each segmentbeing associated with a portion of the body, comprising: a memorybearing instructions that, upon execution by a processor, cause thesystem at least to: receive a depth image that includes at least part ofthe body; determine that a first pixel in the depth image corresponds tothe body; and determine that a second pixel in the depth imagecorresponds to the body based on an angle and a distance of the secondpixel relative to the first pixel.
 9. The system of claim 8, wherein thememory further bears instructions that, upon execution by the processor,cause the system at least to: select the second pixel among a pluralityof pixels in the depth image for determination of whether the secondpixel corresponds to the body based on determining that the angle anddistance of the second pixel relative to the first pixel is below athreshold amount.
 10. The system of claim 9, wherein the thresholdamount would increase where if a depth value associated with the firstpixel were to decrease.
 11. The system of claim 8, wherein the memoryfurther bears instructions that, upon execution by the processor, causethe system at least to: determine a centroid pixel of the body based onthe first pixel and the second pixel.
 12. The system of claim 11,wherein the memory further bears instructions that, upon execution bythe processor, cause the system at least to: determine a location of ajoint of the body based on the centroid pixel.
 13. The system of claim8, wherein the instructions that, upon execution by the processor, causethe system at least to determine that the first pixel in the depth imagecorresponds to the body further cause the system at least to determine afirst probability that the first pixel in the depth image corresponds tothe body, and wherein the instructions that, upon execution by theprocessor, cause the system at least to determine that the second pixelin the depth image corresponds to the body further cause the system atleast to: determine that the second pixel in the depth image correspondsto the body based on the angle and the distance of the second pixelrelative to the first pixel, and the first probability.
 14. The systemof claim 13, wherein the memory further bears instructions that, uponexecution by the processor, cause the system at least to: determine thata third pixel in the depth image has a zero probability of correspondingto the body based on the third pixel having a depth value associatedwith a background.
 15. A computer-readable storage medium bearingcomputer-executable instructions that, when executed on a computer,cause the computer to perform operations comprising: receiving a depthimage that includes at least part of a body; determining that a firstpixel in the depth image corresponds to the body; and determining that asecond pixel in the depth image corresponds to the body based on anangle and a distance of the second pixel relative to the first pixel.16. The computer-readable storage medium of claim 15, whereindetermining that the first pixel in the depth image corresponds to thebody comprises determining a first probability that the first pixel inthe depth image corresponds to the body, and wherein determining thatthe second pixel in the depth image corresponds to the body comprises:determining that the second pixel in the depth image corresponds to thebody based on the angle and the distance of the second pixel relative tothe first pixel, and the first probability.
 17. The computer-readablestorage medium of claim 16, further bearing computer-executableinstructions that, when executed on the computer, cause the computer toperform operations comprising: determining that a third pixel in thedepth image has a zero probability of corresponding to the body based onthe third pixel having a depth value associated with a background. 18.The computer-readable storage medium of claim 16, further bearingcomputer-executable instructions that, when executed on the computer,cause the computer to perform operations comprising: selecting a thirdpixel among a plurality of pixels in the depth image for determinationof whether the third pixel corresponds to the body based on determiningthat the second pixel in the depth image corresponds to the body, andbased on an angle and a distance of the third pixel relative to thesecond pixel.
 19. The computer-readable storage medium of claim 16,further bearing computer-executable instructions that, when executed onthe computer, cause the computer to perform operations comprising:selecting the second pixel among a plurality of pixels in the depthimage for determination of whether the second pixel corresponds to thebody based on determining that the angle and distance of the secondpixel relative to the first pixel is below a threshold amount.
 20. Thecomputer-readable storage medium of claim 19, wherein the thresholdamount would increase where if a depth value associated with the firstpixel were to decrease.